1 00:00:05,910 --> 00:00:10,230 Hello everyone, welcome back to the Heterogeneous Parallel Programming class. 2 00:00:10,230 --> 00:00:15,760 This is lecture 1.2: Introduction to Heterogeneous Parallel Computing. 3 00:00:15,760 --> 00:00:17,840 The objective of this lecture is for you to 4 00:00:17,840 --> 00:00:23,600 learn the major differences between latency devices and throughput devices. 5 00:00:23,600 --> 00:00:29,402 Latency devices means CPU cores today, and throughput devices means GPU cores today. 6 00:00:29,402 --> 00:00:32,164 To understand how and 7 00:00:32,164 --> 00:00:38,270 why winning applications actually are increasingly using both types of devices. 8 00:00:40,990 --> 00:00:49,960 This slide shows a typical system on the chip or SOC for mobile phones today. 9 00:00:49,960 --> 00:00:53,030 In a typical mobile phone, we expect to see 10 00:00:53,030 --> 00:00:58,000 between two and four CPU cores or latency cores. 11 00:00:58,000 --> 00:00:59,853 And we also expect to see two to 12 00:00:59,853 --> 00:01:06,230 four throughput oriented cores or GPU cores where's, 13 00:01:06,230 --> 00:01:11,550 there always, there has been a lot of hardware intellectual property blocks or 14 00:01:11,550 --> 00:01:18,510 hardware IT blocks that are used for video decoding sound processing and so on. 15 00:01:18,510 --> 00:01:24,000 And these IP blocks are becoming more and more programmable as well. 16 00:01:24,000 --> 00:01:30,240 And then we have the digital signal processing cores cores or DSP cores. 17 00:01:30,240 --> 00:01:31,910 And these cores are increasingly 18 00:01:31,910 --> 00:01:37,880 being used for computing. And there is a trend 19 00:01:37,880 --> 00:01:43,130 emergent trend for using configurable logic or configurable course 20 00:01:43,130 --> 00:01:49,400 for for making these SOCs more versatile in terms of, of computing needs. 21 00:01:49,400 --> 00:01:53,770 And finally we also have a increasing amount of on-chip memories 22 00:01:53,770 --> 00:01:57,630 that will be used by all these different types of cores to 23 00:01:57,630 --> 00:02:03,870 minimize their use their assumption of the DRAM access bandwidth. 24 00:02:03,870 --> 00:02:06,160 On top of all these different cores 25 00:02:06,160 --> 00:02:11,310 and memories there's also a cloud service path. 26 00:02:11,310 --> 00:02:16,930 Where the applications can choose to use cloud services rather than a computation 27 00:02:16,930 --> 00:02:22,840 and a local cores if these, if these parts of the application are too demanding for 28 00:02:22,840 --> 00:02:25,020 the course to handle. 29 00:02:25,020 --> 00:02:30,850 So if we look at this environment the applications need to deal 30 00:02:30,850 --> 00:02:35,520 with all these different heterogeneous collection of cores, and that's really 31 00:02:35,520 --> 00:02:40,870 at the heart of what we're teaching in this class. 32 00:02:40,870 --> 00:02:44,180 So the mobile phone domain is not the 33 00:02:44,180 --> 00:02:47,666 only domain that has been using heterogeneous parallel computing. 34 00:02:47,666 --> 00:02:52,540 In the supercomputing domain we have very similar trends where the 35 00:02:52,540 --> 00:02:57,270 top 500 super supercomputers have more 36 00:02:57,270 --> 00:03:02,780 than half of their computing power that come from GPU cores today. 37 00:03:02,780 --> 00:03:08,208 So we're definitely in a in the time where 38 00:03:08,208 --> 00:03:13,940 heterogeneous parallel computing is becoming increasingly important. 39 00:03:13,940 --> 00:03:19,420 And the next few slides with essentially presented fundamental reasons why the 40 00:03:19,420 --> 00:03:27,160 heterogeneous parallel computing is that is an important trend for applications. 41 00:03:27,160 --> 00:03:31,420 So this slide shows that CPUs and GPUs are designed very differently. 42 00:03:32,420 --> 00:03:37,370 The CPUs are designed as latency oriented cores. 43 00:03:37,370 --> 00:03:39,218 And we're showing, I showed 44 00:03:39,218 --> 00:03:41,400 this picture on the left-hand side. 45 00:03:41,400 --> 00:03:44,145 And then we show kind of a simplified picture 46 00:03:44,145 --> 00:03:48,188 of GPUs, or throughput-oriented cores, on the right-hand side. 47 00:03:48,188 --> 00:03:50,897 Just a higher-level comparison before going into 48 00:03:50,897 --> 00:03:53,230 details in the next couple of slides. 49 00:03:53,230 --> 00:04:00,024 I want you to to remember that CPUs tend to have larger local cache, and GPUs tend 50 00:04:00,024 --> 00:04:04,410 to have smaller cache, or local memory. And the GPUs 51 00:04:04,410 --> 00:04:07,038 tend to have a larger number of registers, 52 00:04:07,038 --> 00:04:10,480 mostly to support a large number of threads. 53 00:04:10,480 --> 00:04:13,350 And the CPUs tend to have fewer number 54 00:04:13,350 --> 00:04:17,760 of registers and supports a smaller number of threads. 55 00:04:17,760 --> 00:04:24,120 And GPUs tend to have larger number of SIMD execution units, and the CPUs have 56 00:04:24,120 --> 00:04:29,910 fewer of those SIMD units. The CPUs tend to have sophisticated 57 00:04:29,910 --> 00:04:33,200 control logic and therefore there is a bigger 58 00:04:33,200 --> 00:04:36,770 a big purple block labelled as control, whereas 59 00:04:36,770 --> 00:04:42,220 the GPUs tend to have very simple control, but a large number of threads to manage. 60 00:04:42,220 --> 00:04:46,670 So we have a purple block on the right-hand side that shows 61 00:04:46,670 --> 00:04:52,650 a large amount of logic to manage schedule threads and so on. 62 00:04:52,650 --> 00:04:55,010 So let's go a little bit deeper 63 00:04:55,010 --> 00:05:01,580 into the design of each device. As we mentioned, CPUs are designed 64 00:05:01,580 --> 00:05:06,990 to be to, to designed in a latency sensitive philosophy. 65 00:05:06,990 --> 00:05:13,334 And so in CPU design we typically will have three 66 00:05:13,334 --> 00:05:19,500 prominent features. One is the ALUs or 67 00:05:19,500 --> 00:05:26,240 Arithmetic Logic Units are designed as powerful logic that would generate 68 00:05:26,240 --> 00:05:31,884 these numerical arithmetic results in very few clock cycles. 69 00:05:31,884 --> 00:05:38,550 And in typical CPU cores today for a 64 bit double precision 70 00:05:38,550 --> 00:05:45,250 floating point operation the addition and multiplication operations 71 00:05:45,250 --> 00:05:48,830 on these double precision data values typically will 72 00:05:48,830 --> 00:05:52,030 take only between one and three clock cycles. 73 00:05:52,030 --> 00:05:57,646 And these clocks are also running very very high 74 00:05:57,646 --> 00:06:03,530 frequency and at about 1.523 over 3 gigahertz. 75 00:06:03,530 --> 00:06:10,345 So in terms of the pure time latency of of arithmetic operation, the CPU's 76 00:06:10,345 --> 00:06:15,250 have extremely short latency producing these 77 00:06:15,250 --> 00:06:18,400 floating point of the arithmetic values. 78 00:06:18,400 --> 00:06:22,470 The the next big feature, prominent feature is 79 00:06:22,470 --> 00:06:25,880 that the CPUs tend to have large caches 80 00:06:25,880 --> 00:06:29,010 and these large caches are designed to convert 81 00:06:29,010 --> 00:06:33,950 long latency memory accesses to short latency cache accesses. 82 00:06:33,950 --> 00:06:35,460 And that strategy is to 83 00:06:35,460 --> 00:06:40,700 keep as many data elements as possible in the caches, so that whenever 84 00:06:40,700 --> 00:06:45,810 any of the CPU execution units need to access data chances are, they will 85 00:06:45,810 --> 00:06:50,870 be able to find the data in the cache from a previous access. 86 00:06:52,000 --> 00:06:59,160 So the-, therefore the access plan to this data can be greatly reduced. 87 00:07:00,470 --> 00:07:04,750 The third one is sophisticated control and the, the 88 00:07:04,750 --> 00:07:10,010 control sup-, logic usually manifests itself in two forms. 89 00:07:10,010 --> 00:07:15,480 One is branch prediction support for reduced branch latency. 90 00:07:15,480 --> 00:07:21,310 Branch instructions are generated from high level language constructs such 91 00:07:21,310 --> 00:07:25,910 as if then else statements such as loops. And the decision 92 00:07:25,910 --> 00:07:31,140 for picking one of the pass of the if then else construct or the decision 93 00:07:31,140 --> 00:07:35,840 to determining whether a loop should be iterated more times, or 94 00:07:35,840 --> 00:07:41,700 exit would be, will be realized with control instructions. 95 00:07:41,700 --> 00:07:45,345 So because these control instructions need to make these decisions 96 00:07:45,345 --> 00:07:51,710 uml-, the, in most of the modern microprocessors, the actual decisions 97 00:07:51,710 --> 00:07:54,392 may take a long time to make. 98 00:07:54,392 --> 00:07:58,040 So having a prediction capability to for the 99 00:07:58,040 --> 00:08:01,215 hardware to predict which way the branch instruction will 100 00:08:01,215 --> 00:08:05,380 go, will allow us to immediately fetch and execute 101 00:08:05,380 --> 00:08:08,880 all the instructions that are in a particular path. 102 00:08:08,880 --> 00:08:13,890 However we have to allow for the cases where the prediction may fail. 103 00:08:13,890 --> 00:08:16,860 If the prediction turns out to be incorrect we need 104 00:08:16,860 --> 00:08:21,560 to keep enough of the processors state with to be able to 105 00:08:21,560 --> 00:08:26,320 recover from the incorrect prediction and therefore we need to have all 106 00:08:26,320 --> 00:08:30,160 these resources in the control logic to be able to support the 107 00:08:30,160 --> 00:08:32,270 branch prediction, making the prediction as 108 00:08:32,270 --> 00:08:35,420 well as recovering the incorrect prediction. 109 00:08:35,420 --> 00:08:42,610 We also have a lot of control logic that is decided for data forwarding and which 110 00:08:42,610 --> 00:08:46,840 takes care of the situation where the output result 111 00:08:46,840 --> 00:08:51,130 of one instruction is needed by some subsequent instructions. 112 00:08:51,130 --> 00:08:58,110 And the data forwarding logic that determines where those instructions 113 00:08:58,110 --> 00:09:02,960 are in the pipeline and route the result to 114 00:09:02,960 --> 00:09:07,730 those instructions as quickly as possible. And this involves a large number 115 00:09:07,730 --> 00:09:12,660 of comparison circuitry and also routing circuitry. 116 00:09:12,660 --> 00:09:18,720 So this can take a lot of chip area as well as execution power and so on. 117 00:09:18,720 --> 00:09:21,930 So the, the bottom line is, all these 118 00:09:21,930 --> 00:09:26,730 mechanisms are there to reduce the latency of operations. 119 00:09:26,730 --> 00:09:30,200 The first one reduces the arithmetic calculation 120 00:09:30,200 --> 00:09:33,150 latency, the second one reduces memory access 121 00:09:33,150 --> 00:09:38,420 latency and the third one reduces the branch decision latency as 122 00:09:38,420 --> 00:09:43,930 well as the latency for between the generation of a data value and 123 00:09:43,930 --> 00:09:49,480 the execution of all the instructions that require that value as input. 124 00:09:50,910 --> 00:09:54,320 GPUs on the other hand are designed as throughput oriented 125 00:09:54,320 --> 00:09:59,180 devices and the GPUs tend to have very small caches. 126 00:09:59,180 --> 00:10:05,260 And these caches are not used to retain data in the cache for future accesses, but 127 00:10:05,260 --> 00:10:08,070 rather these caches are designed for as staging 128 00:10:08,070 --> 00:10:12,000 units for m, a large number of threads. 129 00:10:12,000 --> 00:10:18,420 If they're many threads that are simultaneously ex-, ex-, executing require 130 00:10:18,420 --> 00:10:24,580 the same data, then the cache memory will consolidate all these requests into one 131 00:10:24,580 --> 00:10:27,920 so that that request can go to the the DRAM. 132 00:10:27,920 --> 00:10:33,240 And when the data comes back, all done the cache controller will actually 133 00:10:33,240 --> 00:10:38,300 serve as a forwarding logic to the, to distribute the data to 134 00:10:38,300 --> 00:10:43,840 all the execution units or threads that require that data. 135 00:10:43,840 --> 00:10:48,940 So, the latency for going to the DRAM skill will still be there. 136 00:10:48,940 --> 00:10:53,470 However, the logic allows multiple accesses to be 137 00:10:53,470 --> 00:10:59,140 consolidated into into a smaller number so that we can conserve that 138 00:10:59,140 --> 00:11:04,388 traffic go into DRAM. And the second one is simple control. 139 00:11:04,388 --> 00:11:11,660 And this is shown as the the small yellow boxes on the left-hand side. 140 00:11:11,660 --> 00:11:14,170 And in GPU's we typically have 141 00:11:14,170 --> 00:11:17,010 no branch prediction and we typically have 142 00:11:17,010 --> 00:11:19,900 no data forwarding or very little data forwarding. 143 00:11:19,900 --> 00:11:23,890 So the third one is energy efficient ALUs. 144 00:11:23,890 --> 00:11:29,020 Instead of building small number of very powerful ALUs with 145 00:11:29,020 --> 00:11:34,210 very short latency with GPUs usually come with a, a large number 146 00:11:34,210 --> 00:11:39,510 of long latency but very power-efficient low power ALUs. 147 00:11:39,510 --> 00:11:43,740 So these ALUs are usually heavily pipelined so 148 00:11:43,740 --> 00:11:46,890 that they can accept one operation every clock 149 00:11:46,890 --> 00:11:48,990 cycle, and then they will take a long 150 00:11:48,990 --> 00:11:51,970 time for each operation to produce their result. 151 00:11:51,970 --> 00:11:55,230 So, but at the output of the ALU you will 152 00:11:55,230 --> 00:11:59,850 see one result coming out of the pipeline every clock cycle. 153 00:11:59,850 --> 00:12:04,530 So having so many ALUs and such a 154 00:12:04,530 --> 00:12:10,210 long pipeline for every ALU, we need to have a large number of threads. 155 00:12:10,210 --> 00:12:19,080 Each one having a a arithmetic operation that can be executing in those ALUs 156 00:12:19,080 --> 00:12:24,820 in one stage of the ALUs that, so that we can fully utilize all the hardware. 157 00:12:24,820 --> 00:12:29,700 So that's the reason why these devices are designed in such a way, that if 158 00:12:29,700 --> 00:12:35,130 we have a massive number of threads, each one contributing some 159 00:12:35,130 --> 00:12:40,730 you know, operation at the same time, then we can fully utilize this large num-, 160 00:12:40,730 --> 00:12:46,650 these large number of ALUs to produce results in a very high throughput. 161 00:12:46,650 --> 00:12:50,610 However, every thread will take a long time to execute, and every 162 00:12:50,610 --> 00:12:54,960 operation will take a much longer time to execute than the corresponding, 163 00:12:54,960 --> 00:12:57,000 than their CPU counterparts. 164 00:12:58,110 --> 00:13:02,980 So, now that we understand the design philosophy 165 00:13:02,980 --> 00:13:05,660 of the CPU cores and GPU cores, then we 166 00:13:05,660 --> 00:13:07,530 should be able to understand that the winning 167 00:13:07,530 --> 00:13:12,520 applications today need to use both CPUs and GPUs. 168 00:13:12,520 --> 00:13:16,220 And we will use CPUs to execute sequential parts of 169 00:13:16,220 --> 00:13:20,350 the application where latency really matters, because in the sequential 170 00:13:20,350 --> 00:13:23,120 part of the application there are going to be 171 00:13:23,120 --> 00:13:26,480 very few operation that can be done in parallel. 172 00:13:26,480 --> 00:13:30,800 And so we just need to be able to execute these few operations 173 00:13:30,800 --> 00:13:34,920 very, very quickly, so that we can get through these sections very quickly. 174 00:13:34,920 --> 00:13:40,080 On the other hand, we will use GPUs for parallel parts, where the throughput wins. 175 00:13:40,080 --> 00:13:42,920 And then in those parallel parts, we'll have lots 176 00:13:42,920 --> 00:13:45,610 and lots of operations that can be done in parallel. 177 00:13:45,610 --> 00:13:51,350 So we use those operations to fully utilize the large number of ALUs in the 178 00:13:51,350 --> 00:13:54,200 GPU so that we can go through, we 179 00:13:54,200 --> 00:13:59,720 can execute through those parallel sections extremely quickly. 180 00:13:59,720 --> 00:14:03,250 And whenever we use GPUs to execute these parallel sections 181 00:14:03,250 --> 00:14:09,210 they can easily ten or more times higher performance than CPUs. 182 00:14:09,210 --> 00:14:11,190 And obviously the CPUs 183 00:14:11,190 --> 00:14:16,340 can easily acheive ten or more times performance compared 184 00:14:16,340 --> 00:14:20,700 to the GPUs for sequential parts of the application. 185 00:14:22,000 --> 00:14:29,110 So naturally a large number of application domains have already 186 00:14:29,110 --> 00:14:35,970 observed this kind of phenomenon and many of them are using both CPUs and GPUs. 187 00:14:35,970 --> 00:14:37,020 In the 188 00:14:37,020 --> 00:14:41,810 in the field, we typically refer to the computing 189 00:14:41,810 --> 00:14:45,470 that uses both CPUs and GPUs as GPU computing. 190 00:14:45,470 --> 00:14:51,558 And so this is definitely a a a, a growing field. 191 00:14:51,558 --> 00:14:57,348 And in 2010, I edited two volumes of GPU Computing Gems. 192 00:14:57,348 --> 00:15:02,410 And that included 90 articles on 193 00:15:02,410 --> 00:15:05,880 a successful use of GPU computing in different 194 00:15:05,880 --> 00:15:09,640 application fields, as as well as some of the 195 00:15:09,640 --> 00:15:13,610 tools and environments that help application developers to be 196 00:15:13,610 --> 00:15:18,850 able to use heterogeneous computing successfully in their applications. 197 00:15:18,850 --> 00:15:23,420 So the the domains that have successfully used 198 00:15:23,420 --> 00:15:27,450 heterogeneous parallel computing is definitely growing and these are 199 00:15:27,450 --> 00:15:32,550 the fields that were covered by the GPU computing gems volumes. 200 00:15:32,550 --> 00:15:36,870 that, these fields include financial analysis, scientific simulation, 201 00:15:36,870 --> 00:15:42,000 engineering simulation, data intensive analysis, medical imaging, digital 202 00:15:42,000 --> 00:15:47,240 audio processing, digital video processing, computer vision, biomedical 203 00:15:47,240 --> 00:15:53,290 informatics, electronic design automation statistical modeling, numerical methods 204 00:15:53,290 --> 00:15:56,530 ray tracing rendering and interactive physics. 205 00:15:56,530 --> 00:16:03,260 As you can see many of these applications are heavily used in our daily lives. 206 00:16:05,110 --> 00:16:08,950 So this concludes, our lecture 207 00:16:08,950 --> 00:16:12,860 on Introduction to Heterogeneous Parallel Computing. 208 00:16:12,860 --> 00:16:18,710 At this point you should be able to understand the very different nature 209 00:16:18,710 --> 00:16:25,730 of CPUs and GPUs and this will be a important foundation for us to to, 210 00:16:25,730 --> 00:16:30,800 to talk about programming parallel programming of the 211 00:16:30,800 --> 00:16:34,000 throughput oriented devices in the next few weeks. 212 00:16:34,000 --> 00:16:38,850 And if you would like to learn more about the 213 00:16:38,850 --> 00:16:43,710 nature of CPUs versus GPUs and some of the more 214 00:16:43,710 --> 00:16:48,680 subtle you know, insight about Heterogeneous Parallel Computing, I'd like 215 00:16:48,680 --> 00:16:54,830 you to encourage to, you to read chapter one of the textbook by David Kirkeneim. 216 00:16:54,830 --> 00:16:55,200 Thank you.